Based on the laboratory tests collected from the suspected cases, predict the chances of being positive or negative for covid19 and identify the factors that influence it. Also, provide the recommendations to the hospital on how they can better manage the admission of patients to the general ward, semi-intensive unit, or intensive care unit.
One of the motivations for this problem was the fact that in the context of an overwhelmed health system with the possible limitation to performing tests for the detection of SARS-CoV-2, testing every case with some mild symptoms like cold & cough or suffering from mild fever would be impractical and test results could be delayed because of more number of people complaining about mild symptoms & going for tests. So setting up criteria to get tested was very important to focus on getting the test results early & also to reduce the spread of the virus quicker.
To slow the virus's spread across the country & to minimize the number of hospitalizations with severity.
How the hospital can better manage the admission of patients to the general ward, semi-intensive unit, or intensive care unit. The decease spreads so rapidly and large number of cases are coming everyday but the number of hospital beds are limited. Its needed to study deeply to save maximum number of life.
# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To tune model, get different metric scores, and split data
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To impute missing values
from sklearn.impute import SimpleImputer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")
The nb_black extension is already loaded. To reload it, use: %reload_ext nb_black
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.ExcelFile("covid19_dataset.xlsx")
df = pd.read_excel(r"/Users/himadrisamanta/Downloads/covid19_dataset.xlsx")
df.to_csv(
r"/Users/himadrisamanta/Downloads/covid19_dataset.csv", index=None, header=True
)
data1 = pd.read_csv("covid19_dataset.csv")
pd.set_option("display.max_columns", None)
data1.sample(5, random_state=1)
| Patient ID | Patient age quantile | SARS-Cov-2 exam result | Patient addmited to regular ward (1=yes, 0=no) | Patient addmited to semi-intensive unit (1=yes, 0=no) | Patient addmited to intensive care unit (1=yes, 0=no) | Hematocrit | Hemoglobin | Platelets | Mean platelet volume | Red blood Cells | Lymphocytes | Mean corpuscular hemoglobin concentration (MCHC) | Leukocytes | Basophils | Mean corpuscular hemoglobin (MCH) | Eosinophils | Mean corpuscular volume (MCV) | Monocytes | Red blood cell distribution width (RDW) | Serum Glucose | Respiratory Syncytial Virus | Influenza A | Influenza B | Parainfluenza 1 | CoronavirusNL63 | Rhinovirus/Enterovirus | Mycoplasma pneumoniae | Coronavirus HKU1 | Parainfluenza 3 | Chlamydophila pneumoniae | Adenovirus | Parainfluenza 4 | Coronavirus229E | CoronavirusOC43 | Inf A H1N1 2009 | Bordetella pertussis | Metapneumovirus | Parainfluenza 2 | Neutrophils | Urea | Proteina C reativa mg/dL | Creatinine | Potassium | Sodium | Influenza B, rapid test | Influenza A, rapid test | Alanine transaminase | Aspartate transaminase | Gamma-glutamyltransferase | Total Bilirubin | Direct Bilirubin | Indirect Bilirubin | Alkaline phosphatase | Ionized calcium | Strepto A | Magnesium | pCO2 (venous blood gas analysis) | Hb saturation (venous blood gas analysis) | Base excess (venous blood gas analysis) | pO2 (venous blood gas analysis) | Fio2 (venous blood gas analysis) | Total CO2 (venous blood gas analysis) | pH (venous blood gas analysis) | HCO3 (venous blood gas analysis) | Rods # | Segmented | Promyelocytes | Metamyelocytes | Myelocytes | Myeloblasts | Urine - Esterase | Urine - Aspect | Urine - pH | Urine - Hemoglobin | Urine - Bile pigments | Urine - Ketone Bodies | Urine - Nitrite | Urine - Density | Urine - Urobilinogen | Urine - Protein | Urine - Sugar | Urine - Leukocytes | Urine - Crystals | Urine - Red blood cells | Urine - Hyaline cylinders | Urine - Granular cylinders | Urine - Yeasts | Urine - Color | Partial thromboplastin time (PTT) | Relationship (Patient/Normal) | International normalized ratio (INR) | Lactic Dehydrogenase | Prothrombin time (PT), Activity | Vitamin B12 | Creatine phosphokinase (CPK) | Ferritin | Arterial Lactic Acid | Lipase dosage | D-Dimer | Albumin | Hb saturation (arterial blood gases) | pCO2 (arterial blood gas analysis) | Base excess (arterial blood gas analysis) | pH (arterial blood gas analysis) | Total CO2 (arterial blood gas analysis) | HCO3 (arterial blood gas analysis) | pO2 (arterial blood gas analysis) | Arteiral Fio2 | Phosphor | ctO2 (arterial blood gas analysis) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4441 | b7c8bff333721c1 | 12 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | negative | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1603 | 484d8a9c71f01d2 | 1 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1206 | 1f3c363371d0462 | 10 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1586 | 938004044cac19f | 6 | positive | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2730 | 2e4ddd5e391680f | 16 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
print(f"There are {data1.shape[0]} rows and {data1.shape[1]} columns in data.")
There are 5644 rows and 111 columns in data.
data1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5644 entries, 0 to 5643 Columns: 111 entries, Patient ID to ctO2 (arterial blood gas analysis) dtypes: float64(70), int64(4), object(37) memory usage: 4.8+ MB
data = data1.copy()
data.duplicated().sum()
0
There are no duplicate values in the dataset.
data.isnull().sum().T
Patient ID 0 Patient age quantile 0 SARS-Cov-2 exam result 0 Patient addmited to regular ward (1=yes, 0=no) 0 Patient addmited to semi-intensive unit (1=yes, 0=no) 0 Patient addmited to intensive care unit (1=yes, 0=no) 0 Hematocrit 5041 Hemoglobin 5041 Platelets 5042 Mean platelet volume 5045 Red blood Cells 5042 Lymphocytes 5042 Mean corpuscular hemoglobin concentration (MCHC) 5042 Leukocytes 5042 Basophils 5042 Mean corpuscular hemoglobin (MCH) 5042 Eosinophils 5042 Mean corpuscular volume (MCV) 5042 Monocytes 5043 Red blood cell distribution width (RDW) 5042 Serum Glucose 5436 Respiratory Syncytial Virus 4290 Influenza A 4290 Influenza B 4290 Parainfluenza 1 4292 CoronavirusNL63 4292 Rhinovirus/Enterovirus 4292 Mycoplasma pneumoniae 5644 Coronavirus HKU1 4292 Parainfluenza 3 4292 Chlamydophila pneumoniae 4292 Adenovirus 4292 Parainfluenza 4 4292 Coronavirus229E 4292 CoronavirusOC43 4292 Inf A H1N1 2009 4292 Bordetella pertussis 4292 Metapneumovirus 4292 Parainfluenza 2 4292 Neutrophils 5131 Urea 5247 Proteina C reativa mg/dL 5138 Creatinine 5220 Potassium 5273 Sodium 5274 Influenza B, rapid test 4824 Influenza A, rapid test 4824 Alanine transaminase 5419 Aspartate transaminase 5418 Gamma-glutamyltransferase 5491 Total Bilirubin 5462 Direct Bilirubin 5462 Indirect Bilirubin 5462 Alkaline phosphatase 5500 Ionized calcium 5594 Strepto A 5312 Magnesium 5604 pCO2 (venous blood gas analysis) 5508 Hb saturation (venous blood gas analysis) 5508 Base excess (venous blood gas analysis) 5508 pO2 (venous blood gas analysis) 5508 Fio2 (venous blood gas analysis) 5643 Total CO2 (venous blood gas analysis) 5508 pH (venous blood gas analysis) 5508 HCO3 (venous blood gas analysis) 5508 Rods # 5547 Segmented 5547 Promyelocytes 5547 Metamyelocytes 5547 Myelocytes 5547 Myeloblasts 5547 Urine - Esterase 5584 Urine - Aspect 5574 Urine - pH 5574 Urine - Hemoglobin 5574 Urine - Bile pigments 5574 Urine - Ketone Bodies 5587 Urine - Nitrite 5643 Urine - Density 5574 Urine - Urobilinogen 5575 Urine - Protein 5584 Urine - Sugar 5644 Urine - Leukocytes 5574 Urine - Crystals 5574 Urine - Red blood cells 5574 Urine - Hyaline cylinders 5577 Urine - Granular cylinders 5575 Urine - Yeasts 5574 Urine - Color 5574 Partial thromboplastin time (PTT) 5644 Relationship (Patient/Normal) 5553 International normalized ratio (INR) 5511 Lactic Dehydrogenase 5543 Prothrombin time (PT), Activity 5644 Vitamin B12 5641 Creatine phosphokinase (CPK) 5540 Ferritin 5621 Arterial Lactic Acid 5617 Lipase dosage 5636 D-Dimer 5644 Albumin 5631 Hb saturation (arterial blood gases) 5617 pCO2 (arterial blood gas analysis) 5617 Base excess (arterial blood gas analysis) 5617 pH (arterial blood gas analysis) 5617 Total CO2 (arterial blood gas analysis) 5617 HCO3 (arterial blood gas analysis) 5617 pO2 (arterial blood gas analysis) 5617 Arteiral Fio2 5624 Phosphor 5624 ctO2 (arterial blood gas analysis) 5617 dtype: int64
There are many missing values in the number pf columns.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Patient age quantile | 5644.000 | 9.318 | 5.778 | 0.000 | 4.000 | 9.000 | 14.000 | 19.000 |
| Patient addmited to regular ward (1=yes, 0=no) | 5644.000 | 0.014 | 0.117 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Patient addmited to semi-intensive unit (1=yes, 0=no) | 5644.000 | 0.009 | 0.094 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Patient addmited to intensive care unit (1=yes, 0=no) | 5644.000 | 0.007 | 0.085 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Hematocrit | 603.000 | -0.000 | 1.001 | -4.501 | -0.519 | 0.053 | 0.717 | 2.663 |
| Hemoglobin | 603.000 | -0.000 | 1.001 | -4.346 | -0.586 | 0.040 | 0.730 | 2.672 |
| Platelets | 602.000 | -0.000 | 1.001 | -2.552 | -0.605 | -0.122 | 0.531 | 9.532 |
| Mean platelet volume | 599.000 | 0.000 | 1.001 | -2.458 | -0.662 | -0.102 | 0.684 | 3.713 |
| Red blood Cells | 602.000 | 0.000 | 1.001 | -3.971 | -0.568 | 0.014 | 0.666 | 3.646 |
| Lymphocytes | 602.000 | -0.000 | 1.001 | -1.865 | -0.731 | -0.014 | 0.598 | 3.764 |
| Mean corpuscular hemoglobin concentration (MCHC) | 602.000 | 0.000 | 1.001 | -5.432 | -0.552 | -0.055 | 0.642 | 3.331 |
| Leukocytes | 602.000 | 0.000 | 1.001 | -2.020 | -0.637 | -0.213 | 0.454 | 4.522 |
| Basophils | 602.000 | -0.000 | 1.001 | -1.140 | -0.529 | -0.224 | 0.387 | 11.078 |
| Mean corpuscular hemoglobin (MCH) | 602.000 | -0.000 | 1.001 | -5.938 | -0.501 | 0.126 | 0.596 | 4.099 |
| Eosinophils | 602.000 | 0.000 | 1.001 | -0.836 | -0.667 | -0.330 | 0.344 | 8.351 |
| Mean corpuscular volume (MCV) | 602.000 | -0.000 | 1.001 | -5.102 | -0.515 | 0.066 | 0.627 | 3.411 |
| Monocytes | 601.000 | -0.000 | 1.001 | -2.164 | -0.614 | -0.115 | 0.489 | 4.533 |
| Red blood cell distribution width (RDW) | 602.000 | 0.000 | 1.001 | -1.598 | -0.625 | -0.183 | 0.348 | 6.982 |
| Serum Glucose | 208.000 | 0.000 | 1.002 | -1.110 | -0.504 | -0.292 | 0.139 | 7.006 |
| Mycoplasma pneumoniae | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Neutrophils | 513.000 | 0.000 | 1.001 | -3.340 | -0.652 | -0.054 | 0.684 | 2.536 |
| Urea | 397.000 | -0.000 | 1.001 | -1.630 | -0.588 | -0.142 | 0.454 | 11.247 |
| Proteina C reativa mg/dL | 506.000 | 0.000 | 1.001 | -0.535 | -0.514 | -0.394 | 0.032 | 8.027 |
| Creatinine | 424.000 | -0.000 | 1.001 | -2.390 | -0.632 | -0.081 | 0.513 | 5.054 |
| Potassium | 371.000 | 0.000 | 1.001 | -2.283 | -0.800 | -0.059 | 0.683 | 3.402 |
| Sodium | 370.000 | 0.000 | 1.001 | -5.247 | -0.575 | 0.144 | 0.503 | 4.097 |
| Alanine transaminase | 225.000 | 0.000 | 1.002 | -0.642 | -0.449 | -0.284 | 0.102 | 7.931 |
| Aspartate transaminase | 226.000 | -0.000 | 1.002 | -0.704 | -0.433 | -0.278 | 0.031 | 7.231 |
| Gamma-glutamyltransferase | 153.000 | -0.000 | 1.003 | -0.477 | -0.376 | -0.286 | -0.061 | 8.508 |
| Total Bilirubin | 182.000 | -0.000 | 1.003 | -1.093 | -0.787 | -0.175 | 0.131 | 5.029 |
| Direct Bilirubin | 182.000 | 0.000 | 1.003 | -1.170 | -0.586 | -0.003 | -0.003 | 6.996 |
| Indirect Bilirubin | 182.000 | 0.000 | 1.003 | -0.771 | -0.771 | -0.279 | 0.214 | 6.615 |
| Alkaline phosphatase | 144.000 | -0.000 | 1.003 | -0.959 | -0.609 | -0.358 | 0.054 | 3.883 |
| Ionized calcium | 50.000 | 0.000 | 1.010 | -2.100 | -0.729 | 0.060 | 0.558 | 3.549 |
| Magnesium | 40.000 | -0.000 | 1.013 | -2.191 | -0.558 | -0.014 | 0.531 | 2.164 |
| pCO2 (venous blood gas analysis) | 136.000 | -0.000 | 1.004 | -2.705 | -0.547 | 0.014 | 0.619 | 5.680 |
| Hb saturation (venous blood gas analysis) | 136.000 | 0.000 | 1.004 | -2.296 | -0.803 | 0.090 | 0.817 | 1.708 |
| Base excess (venous blood gas analysis) | 136.000 | -0.000 | 1.004 | -3.669 | -0.402 | 0.080 | 0.554 | 3.357 |
| pO2 (venous blood gas analysis) | 136.000 | -0.000 | 1.004 | -1.634 | -0.694 | -0.213 | 0.483 | 3.775 |
| Fio2 (venous blood gas analysis) | 1.000 | 0.000 | NaN | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Total CO2 (venous blood gas analysis) | 136.000 | -0.000 | 1.004 | -2.598 | -0.495 | 0.104 | 0.542 | 3.021 |
| pH (venous blood gas analysis) | 136.000 | 0.000 | 1.004 | -4.773 | -0.526 | -0.091 | 0.490 | 2.790 |
| HCO3 (venous blood gas analysis) | 136.000 | -0.000 | 1.004 | -2.645 | -0.529 | 0.101 | 0.529 | 2.782 |
| Rods # | 97.000 | 0.000 | 1.005 | -0.624 | -0.624 | -0.624 | 0.326 | 3.496 |
| Segmented | 97.000 | -0.000 | 1.005 | -2.264 | -0.673 | 0.176 | 0.919 | 1.502 |
| Promyelocytes | 97.000 | 0.000 | 1.005 | -0.102 | -0.102 | -0.102 | -0.102 | 9.798 |
| Metamyelocytes | 97.000 | 0.000 | 1.005 | -0.316 | -0.316 | -0.316 | -0.316 | 6.136 |
| Myelocytes | 97.000 | 0.000 | 1.005 | -0.233 | -0.233 | -0.233 | -0.233 | 6.551 |
| Myeloblasts | 97.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Urine - Density | 70.000 | -0.000 | 1.007 | -1.757 | -0.764 | -0.055 | 0.655 | 2.499 |
| Urine - Sugar | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Red blood cells | 70.000 | 0.000 | 1.007 | -0.202 | -0.202 | -0.194 | -0.166 | 7.822 |
| Partial thromboplastin time (PTT) | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Relationship (Patient/Normal) | 91.000 | -0.000 | 1.006 | -2.351 | -0.497 | -0.089 | 0.453 | 4.706 |
| International normalized ratio (INR) | 133.000 | -0.000 | 1.004 | -1.797 | -0.665 | -0.156 | 0.297 | 7.370 |
| Lactic Dehydrogenase | 101.000 | 0.000 | 1.005 | -1.359 | -0.700 | -0.331 | 0.473 | 2.950 |
| Prothrombin time (PT), Activity | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Vitamin B12 | 3.000 | -0.000 | 1.225 | -1.401 | -0.435 | 0.531 | 0.700 | 0.870 |
| Creatine phosphokinase (CPK) | 104.000 | -0.000 | 1.005 | -0.516 | -0.377 | -0.225 | 0.035 | 7.216 |
| Ferritin | 23.000 | 0.000 | 1.022 | -0.628 | -0.560 | -0.358 | 0.120 | 3.846 |
| Arterial Lactic Acid | 27.000 | -0.000 | 1.019 | -1.091 | -0.695 | -0.298 | 0.230 | 3.004 |
| Lipase dosage | 8.000 | -0.000 | 1.069 | -1.192 | -0.547 | -0.351 | 0.182 | 1.725 |
| D-Dimer | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Albumin | 13.000 | -0.000 | 1.041 | -2.290 | -0.539 | -0.038 | 0.462 | 1.963 |
| Hb saturation (arterial blood gases) | 27.000 | -0.000 | 1.019 | -2.000 | -1.123 | 0.268 | 0.738 | 1.337 |
| pCO2 (arterial blood gas analysis) | 27.000 | 0.000 | 1.019 | -1.245 | -0.535 | -0.212 | 0.023 | 3.237 |
| Base excess (arterial blood gas analysis) | 27.000 | -0.000 | 1.019 | -3.083 | -0.331 | -0.012 | 0.666 | 1.703 |
| pH (arterial blood gas analysis) | 27.000 | 0.000 | 1.019 | -3.569 | -0.092 | 0.294 | 0.512 | 1.043 |
| Total CO2 (arterial blood gas analysis) | 27.000 | -0.000 | 1.019 | -2.926 | -0.512 | 0.077 | 0.439 | 1.940 |
| HCO3 (arterial blood gas analysis) | 27.000 | 0.000 | 1.019 | -2.986 | -0.540 | 0.056 | 0.509 | 2.029 |
| pO2 (arterial blood gas analysis) | 27.000 | -0.000 | 1.019 | -1.176 | -0.817 | -0.160 | 0.450 | 2.205 |
| Arteiral Fio2 | 20.000 | 0.000 | 1.026 | -1.533 | -0.121 | -0.012 | -0.012 | 2.842 |
| Phosphor | 20.000 | 0.000 | 1.026 | -1.481 | -0.553 | -0.138 | 0.276 | 2.862 |
| ctO2 (arterial blood gas analysis) | 27.000 | 0.000 | 1.019 | -2.900 | -0.485 | 0.183 | 0.594 | 1.827 |
data.head()
| Patient ID | Patient age quantile | SARS-Cov-2 exam result | Patient addmited to regular ward (1=yes, 0=no) | Patient addmited to semi-intensive unit (1=yes, 0=no) | Patient addmited to intensive care unit (1=yes, 0=no) | Hematocrit | Hemoglobin | Platelets | Mean platelet volume | Red blood Cells | Lymphocytes | Mean corpuscular hemoglobin concentration (MCHC) | Leukocytes | Basophils | Mean corpuscular hemoglobin (MCH) | Eosinophils | Mean corpuscular volume (MCV) | Monocytes | Red blood cell distribution width (RDW) | Serum Glucose | Respiratory Syncytial Virus | Influenza A | Influenza B | Parainfluenza 1 | CoronavirusNL63 | Rhinovirus/Enterovirus | Mycoplasma pneumoniae | Coronavirus HKU1 | Parainfluenza 3 | Chlamydophila pneumoniae | Adenovirus | Parainfluenza 4 | Coronavirus229E | CoronavirusOC43 | Inf A H1N1 2009 | Bordetella pertussis | Metapneumovirus | Parainfluenza 2 | Neutrophils | Urea | Proteina C reativa mg/dL | Creatinine | Potassium | Sodium | Influenza B, rapid test | Influenza A, rapid test | Alanine transaminase | Aspartate transaminase | Gamma-glutamyltransferase | Total Bilirubin | Direct Bilirubin | Indirect Bilirubin | Alkaline phosphatase | Ionized calcium | Strepto A | Magnesium | pCO2 (venous blood gas analysis) | Hb saturation (venous blood gas analysis) | Base excess (venous blood gas analysis) | pO2 (venous blood gas analysis) | Fio2 (venous blood gas analysis) | Total CO2 (venous blood gas analysis) | pH (venous blood gas analysis) | HCO3 (venous blood gas analysis) | Rods # | Segmented | Promyelocytes | Metamyelocytes | Myelocytes | Myeloblasts | Urine - Esterase | Urine - Aspect | Urine - pH | Urine - Hemoglobin | Urine - Bile pigments | Urine - Ketone Bodies | Urine - Nitrite | Urine - Density | Urine - Urobilinogen | Urine - Protein | Urine - Sugar | Urine - Leukocytes | Urine - Crystals | Urine - Red blood cells | Urine - Hyaline cylinders | Urine - Granular cylinders | Urine - Yeasts | Urine - Color | Partial thromboplastin time (PTT) | Relationship (Patient/Normal) | International normalized ratio (INR) | Lactic Dehydrogenase | Prothrombin time (PT), Activity | Vitamin B12 | Creatine phosphokinase (CPK) | Ferritin | Arterial Lactic Acid | Lipase dosage | D-Dimer | Albumin | Hb saturation (arterial blood gases) | pCO2 (arterial blood gas analysis) | Base excess (arterial blood gas analysis) | pH (arterial blood gas analysis) | Total CO2 (arterial blood gas analysis) | HCO3 (arterial blood gas analysis) | pO2 (arterial blood gas analysis) | Arteiral Fio2 | Phosphor | ctO2 (arterial blood gas analysis) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44477f75e8169d2 | 13 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 126e9dd13932f68 | 17 | negative | 0 | 0 | 0 | 0.237 | -0.022 | -0.517 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.224 | -0.292 | 1.482 | 0.166 | 0.358 | -0.625 | -0.141 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | -0.619 | 1.198 | -0.148 | 2.090 | -0.306 | 0.863 | negative | negative | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | a46b4402a0e5696 | 8 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | f7d619a94f97c45 | 5 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | d9e41465789c2b5 | 15 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | detected | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
data.drop(["Patient ID"], axis=1, inplace=True)
data.columns = [c.replace("#", "") for c in data.columns]
data.columns = [c.replace("-", "") for c in data.columns]
data.columns = [c.replace("=", "") for c in data.columns]
data.columns = [c.replace("(", "") for c in data.columns]
data.columns = [c.replace(")", "") for c in data.columns]
data.columns = [c.replace("<", "") for c in data.columns]
data.columns = [c.replace(",", "") for c in data.columns]
data.columns = [c.replace(" ", "") for c in data.columns]
data.columns = [c.replace("/", "") for c in data.columns]
data.sample(5, random_state=1)
| Patientagequantile | SARSCov2examresult | Patientaddmitedtoregularward1yes0no | Patientaddmitedtosemiintensiveunit1yes0no | Patientaddmitedtointensivecareunit1yes0no | Hematocrit | Hemoglobin | Platelets | Meanplateletvolume | RedbloodCells | Lymphocytes | Meancorpuscularhemoglobinconcentration MCHC | Leukocytes | Basophils | MeancorpuscularhemoglobinMCH | Eosinophils | MeancorpuscularvolumeMCV | Monocytes | RedbloodcelldistributionwidthRDW | SerumGlucose | RespiratorySyncytialVirus | InfluenzaA | InfluenzaB | Parainfluenza1 | CoronavirusNL63 | RhinovirusEnterovirus | Mycoplasmapneumoniae | CoronavirusHKU1 | Parainfluenza3 | Chlamydophilapneumoniae | Adenovirus | Parainfluenza4 | Coronavirus229E | CoronavirusOC43 | InfAH1N12009 | Bordetellapertussis | Metapneumovirus | Parainfluenza2 | Neutrophils | Urea | ProteinaCreativamgdL | Creatinine | Potassium | Sodium | InfluenzaBrapidtest | InfluenzaArapidtest | Alaninetransaminase | Aspartatetransaminase | Gammaglutamyltransferase | TotalBilirubin | DirectBilirubin | IndirectBilirubin | Alkalinephosphatase | Ionizedcalcium | StreptoA | Magnesium | pCO2venousbloodgasanalysis | Hbsaturationvenousbloodgasanalysis | Baseexcessvenousbloodgasanalysis | pO2venousbloodgasanalysis | Fio2venousbloodgasanalysis | TotalCO2venousbloodgasanalysis | pHvenousbloodgasanalysis | HCO3venousbloodgasanalysis | Rods | Segmented | Promyelocytes | Metamyelocytes | Myelocytes | Myeloblasts | UrineEsterase | UrineAspect | UrinepH | UrineHemoglobin | UrineBilepigments | UrineKetoneBodies | UrineNitrite | UrineDensity | UrineUrobilinogen | UrineProtein | UrineSugar | UrineLeukocytes | UrineCrystals | UrineRedbloodcells | UrineHyalinecylinders | UrineGranularcylinders | UrineYeasts | UrineColor | Partialthromboplastintime PTT | RelationshipPatientNormal | InternationalnormalizedratioINR | LacticDehydrogenase | ProthrombintimePTActivity | VitaminB12 | Creatinephosphokinase CPK | Ferritin | ArterialLacticAcid | Lipasedosage | DDimer | Albumin | Hbsaturationarterialbloodgases | pCO2arterialbloodgasanalysis | Baseexcessarterialbloodgasanalysis | pHarterialbloodgasanalysis | TotalCO2arterialbloodgasanalysis | HCO3arterialbloodgasanalysis | pO2arterialbloodgasanalysis | ArteiralFio2 | Phosphor | ctO2arterialbloodgasanalysis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4441 | 12 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | negative | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1603 | 1 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1206 | 10 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1586 | 6 | positive | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2730 | 16 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
cat_col = list(data.select_dtypes("object").columns)
# Printing number of count of each unique value in each column
for column in cat_col:
print(data[column].value_counts())
print("-" * 50)
negative 5086 positive 558 Name: SARSCov2examresult, dtype: int64 -------------------------------------------------- not_detected 1302 detected 52 Name: RespiratorySyncytialVirus, dtype: int64 -------------------------------------------------- not_detected 1336 detected 18 Name: InfluenzaA, dtype: int64 -------------------------------------------------- not_detected 1277 detected 77 Name: InfluenzaB, dtype: int64 -------------------------------------------------- not_detected 1349 detected 3 Name: Parainfluenza1, dtype: int64 -------------------------------------------------- not_detected 1307 detected 45 Name: CoronavirusNL63, dtype: int64 -------------------------------------------------- not_detected 973 detected 379 Name: RhinovirusEnterovirus, dtype: int64 -------------------------------------------------- not_detected 1332 detected 20 Name: CoronavirusHKU1, dtype: int64 -------------------------------------------------- not_detected 1342 detected 10 Name: Parainfluenza3, dtype: int64 -------------------------------------------------- not_detected 1343 detected 9 Name: Chlamydophilapneumoniae, dtype: int64 -------------------------------------------------- not_detected 1339 detected 13 Name: Adenovirus, dtype: int64 -------------------------------------------------- not_detected 1333 detected 19 Name: Parainfluenza4, dtype: int64 -------------------------------------------------- not_detected 1343 detected 9 Name: Coronavirus229E, dtype: int64 -------------------------------------------------- not_detected 1344 detected 8 Name: CoronavirusOC43, dtype: int64 -------------------------------------------------- not_detected 1254 detected 98 Name: InfAH1N12009, dtype: int64 -------------------------------------------------- not_detected 1350 detected 2 Name: Bordetellapertussis, dtype: int64 -------------------------------------------------- not_detected 1338 detected 14 Name: Metapneumovirus, dtype: int64 -------------------------------------------------- not_detected 1352 Name: Parainfluenza2, dtype: int64 -------------------------------------------------- negative 771 positive 49 Name: InfluenzaBrapidtest, dtype: int64 -------------------------------------------------- negative 768 positive 52 Name: InfluenzaArapidtest, dtype: int64 -------------------------------------------------- negative 297 positive 34 not_done 1 Name: StreptoA, dtype: int64 -------------------------------------------------- absent 59 not_done 1 Name: UrineEsterase, dtype: int64 -------------------------------------------------- clear 61 cloudy 5 lightly_cloudy 3 altered_coloring 1 Name: UrineAspect, dtype: int64 -------------------------------------------------- 5.0 14 5 11 6.5 10 7.0 8 6.0 7 6 7 7 4 5.5 4 7.5 3 8.0 1 Não Realizado 1 Name: UrinepH, dtype: int64 -------------------------------------------------- absent 53 present 16 not_done 1 Name: UrineHemoglobin, dtype: int64 -------------------------------------------------- absent 69 not_done 1 Name: UrineBilepigments, dtype: int64 -------------------------------------------------- absent 56 not_done 1 Name: UrineKetoneBodies, dtype: int64 -------------------------------------------------- not_done 1 Name: UrineNitrite, dtype: int64 -------------------------------------------------- normal 68 not_done 1 Name: UrineUrobilinogen, dtype: int64 -------------------------------------------------- absent 59 not_done 1 Name: UrineProtein, dtype: int64 -------------------------------------------------- <1000 9 3000 9 4000 7 2000 7 1000 4 8000 3 10000 2 29000 2 7000 2 5000 2 22000 2 38000 2 2600 1 43000 1 40000 1 32000 1 6000 1 624000 1 2500 1 28000 1 10600 1 23000 1 229000 1 3310000 1 5942000 1 4600 1 124000 1 19000 1 16000 1 5300 1 77000 1 Name: UrineLeukocytes, dtype: int64 -------------------------------------------------- Ausentes 65 Urato Amorfo --+ 2 Oxalato de Cálcio +++ 1 Urato Amorfo +++ 1 Oxalato de Cálcio -++ 1 Name: UrineCrystals, dtype: int64 -------------------------------------------------- absent 67 Name: UrineHyalinecylinders, dtype: int64 -------------------------------------------------- absent 69 Name: UrineGranularcylinders, dtype: int64 -------------------------------------------------- absent 70 Name: UrineYeasts, dtype: int64 -------------------------------------------------- yellow 55 light_yellow 13 citrus_yellow 1 orange 1 Name: UrineColor, dtype: int64 --------------------------------------------------
cat_col
['SARSCov2examresult', 'RespiratorySyncytialVirus', 'InfluenzaA', 'InfluenzaB', 'Parainfluenza1', 'CoronavirusNL63', 'RhinovirusEnterovirus', 'CoronavirusHKU1', 'Parainfluenza3', 'Chlamydophilapneumoniae', 'Adenovirus', 'Parainfluenza4', 'Coronavirus229E', 'CoronavirusOC43', 'InfAH1N12009', 'Bordetellapertussis', 'Metapneumovirus', 'Parainfluenza2', 'InfluenzaBrapidtest', 'InfluenzaArapidtest', 'StreptoA', 'UrineEsterase', 'UrineAspect', 'UrinepH', 'UrineHemoglobin', 'UrineBilepigments', 'UrineKetoneBodies', 'UrineNitrite', 'UrineUrobilinogen', 'UrineProtein', 'UrineLeukocytes', 'UrineCrystals', 'UrineHyalinecylinders', 'UrineGranularcylinders', 'UrineYeasts', 'UrineColor']
data["SARSCov2examresult"] = data["SARSCov2examresult"].apply(
lambda x: 1 if x == "positive" else 0
)
data["RespiratorySyncytialVirus"] = data["RespiratorySyncytialVirus"].apply(
lambda x: 1 if x == "detected" else 0
)
data["InfluenzaA"] = data["InfluenzaA"].apply(lambda x: 1 if x == "detected" else 0)
data["InfluenzaB"] = data["InfluenzaB"].apply(lambda x: 1 if x == "detected" else 0)
data["Parainfluenza1"] = data["Parainfluenza1"].apply(
lambda x: 1 if x == "detected" else 0
)
data["CoronavirusNL63"] = data["CoronavirusNL63"].apply(
lambda x: 1 if x == "detected" else 0
)
data["RhinovirusEnterovirus"] = data["RhinovirusEnterovirus"].apply(
lambda x: 1 if x == "detected" else 0
)
data["CoronavirusHKU1"] = data["CoronavirusHKU1"].apply(
lambda x: 1 if x == "detected" else 0
)
data["Parainfluenza3"] = data["Parainfluenza3"].apply(
lambda x: 1 if x == "detected" else 0
)
data["Chlamydophilapneumoniae"] = data["Chlamydophilapneumoniae"].apply(
lambda x: 1 if x == "detected" else 0
)
data["Adenovirus"] = data["Adenovirus"].apply(lambda x: 1 if x == "detected" else 0)
data["Parainfluenza4"] = data["Parainfluenza4"].apply(
lambda x: 1 if x == "detected" else 0
)
data["Coronavirus229E"] = data["Coronavirus229E"].apply(
lambda x: 1 if x == "detected" else 0
)
data["CoronavirusOC43"] = data["CoronavirusOC43"].apply(
lambda x: 1 if x == "detected" else 0
)
data["InfAH1N12009"] = data["InfAH1N12009"].apply(lambda x: 1 if x == "detected" else 0)
data["Bordetellapertussis"] = data["Bordetellapertussis"].apply(
lambda x: 1 if x == "detected" else 0
)
data["Metapneumovirus"] = data["Metapneumovirus"].apply(
lambda x: 1 if x == "detected" else 0
)
data["Parainfluenza2"] = data["Parainfluenza2"].apply(
lambda x: 1 if x == "detected" else 0
)
data["InfluenzaBrapidtest"] = data["InfluenzaBrapidtest"].apply(
lambda x: 1 if x == "detected" else 0
)
data["StreptoA"] = data["StreptoA"].apply(lambda x: 1 if x == "detected" else 0)
data5["InfluenzaBrapidtest"] = data5["InfluenzaBrapidtest"].apply(
lambda x: 1 if x == "positive" else 0
)
data["InfluenzaArapidtest"] = data["InfluenzaArapidtest"].apply(
lambda x: 1 if x == "positive" else 0
)
data.sample(5, random_state=1)
| Patientagequantile | SARSCov2examresult | Patientaddmitedtoregularward1yes0no | Patientaddmitedtosemiintensiveunit1yes0no | Patientaddmitedtointensivecareunit1yes0no | Hematocrit | Hemoglobin | Platelets | Meanplateletvolume | RedbloodCells | Lymphocytes | Meancorpuscularhemoglobinconcentration MCHC | Leukocytes | Basophils | MeancorpuscularhemoglobinMCH | Eosinophils | MeancorpuscularvolumeMCV | Monocytes | RedbloodcelldistributionwidthRDW | SerumGlucose | RespiratorySyncytialVirus | InfluenzaA | InfluenzaB | Parainfluenza1 | CoronavirusNL63 | RhinovirusEnterovirus | Mycoplasmapneumoniae | CoronavirusHKU1 | Parainfluenza3 | Chlamydophilapneumoniae | Adenovirus | Parainfluenza4 | Coronavirus229E | CoronavirusOC43 | InfAH1N12009 | Bordetellapertussis | Metapneumovirus | Parainfluenza2 | Neutrophils | Urea | ProteinaCreativamgdL | Creatinine | Potassium | Sodium | InfluenzaBrapidtest | InfluenzaArapidtest | Alaninetransaminase | Aspartatetransaminase | Gammaglutamyltransferase | TotalBilirubin | DirectBilirubin | IndirectBilirubin | Alkalinephosphatase | Ionizedcalcium | StreptoA | Magnesium | pCO2venousbloodgasanalysis | Hbsaturationvenousbloodgasanalysis | Baseexcessvenousbloodgasanalysis | pO2venousbloodgasanalysis | Fio2venousbloodgasanalysis | TotalCO2venousbloodgasanalysis | pHvenousbloodgasanalysis | HCO3venousbloodgasanalysis | Rods | Segmented | Promyelocytes | Metamyelocytes | Myelocytes | Myeloblasts | UrineEsterase | UrineAspect | UrinepH | UrineHemoglobin | UrineBilepigments | UrineKetoneBodies | UrineNitrite | UrineDensity | UrineUrobilinogen | UrineProtein | UrineSugar | UrineLeukocytes | UrineCrystals | UrineRedbloodcells | UrineHyalinecylinders | UrineGranularcylinders | UrineYeasts | UrineColor | Partialthromboplastintime PTT | RelationshipPatientNormal | InternationalnormalizedratioINR | LacticDehydrogenase | ProthrombintimePTActivity | VitaminB12 | Creatinephosphokinase CPK | Ferritin | ArterialLacticAcid | Lipasedosage | DDimer | Albumin | Hbsaturationarterialbloodgases | pCO2arterialbloodgasanalysis | Baseexcessarterialbloodgasanalysis | pHarterialbloodgasanalysis | TotalCO2arterialbloodgasanalysis | HCO3arterialbloodgasanalysis | pO2arterialbloodgasanalysis | ArteiralFio2 | Phosphor | ctO2arterialbloodgasanalysis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4441 | 12 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1603 | 1 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1206 | 10 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1586 | 6 | 1 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2730 | 16 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# function to create labeled barplots
def labeled_barplot(df, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df[feature]) # length of the column
count = df[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df,
x=feature,
palette="Paired",
order=df[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
result = data.select_dtypes(include="number")
result.head()
| Patientagequantile | SARSCov2examresult | Patientaddmitedtoregularward1yes0no | Patientaddmitedtosemiintensiveunit1yes0no | Patientaddmitedtointensivecareunit1yes0no | Hematocrit | Hemoglobin | Platelets | Meanplateletvolume | RedbloodCells | Lymphocytes | Meancorpuscularhemoglobinconcentration MCHC | Leukocytes | Basophils | MeancorpuscularhemoglobinMCH | Eosinophils | MeancorpuscularvolumeMCV | Monocytes | RedbloodcelldistributionwidthRDW | SerumGlucose | RespiratorySyncytialVirus | InfluenzaA | InfluenzaB | Parainfluenza1 | CoronavirusNL63 | RhinovirusEnterovirus | Mycoplasmapneumoniae | CoronavirusHKU1 | Parainfluenza3 | Chlamydophilapneumoniae | Adenovirus | Parainfluenza4 | Coronavirus229E | CoronavirusOC43 | InfAH1N12009 | Bordetellapertussis | Metapneumovirus | Parainfluenza2 | Neutrophils | Urea | ProteinaCreativamgdL | Creatinine | Potassium | Sodium | InfluenzaBrapidtest | InfluenzaArapidtest | Alaninetransaminase | Aspartatetransaminase | Gammaglutamyltransferase | TotalBilirubin | DirectBilirubin | IndirectBilirubin | Alkalinephosphatase | Ionizedcalcium | StreptoA | Magnesium | pCO2venousbloodgasanalysis | Hbsaturationvenousbloodgasanalysis | Baseexcessvenousbloodgasanalysis | pO2venousbloodgasanalysis | Fio2venousbloodgasanalysis | TotalCO2venousbloodgasanalysis | pHvenousbloodgasanalysis | HCO3venousbloodgasanalysis | Rods | Segmented | Promyelocytes | Metamyelocytes | Myelocytes | Myeloblasts | UrineDensity | UrineSugar | UrineRedbloodcells | Partialthromboplastintime PTT | RelationshipPatientNormal | InternationalnormalizedratioINR | LacticDehydrogenase | ProthrombintimePTActivity | VitaminB12 | Creatinephosphokinase CPK | Ferritin | ArterialLacticAcid | Lipasedosage | DDimer | Albumin | Hbsaturationarterialbloodgases | pCO2arterialbloodgasanalysis | Baseexcessarterialbloodgasanalysis | pHarterialbloodgasanalysis | TotalCO2arterialbloodgasanalysis | HCO3arterialbloodgasanalysis | pO2arterialbloodgasanalysis | ArteiralFio2 | Phosphor | ctO2arterialbloodgasanalysis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 17 | 0 | 0 | 0 | 0 | 0.237 | -0.022 | -0.517 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.224 | -0.292 | 1.482 | 0.166 | 0.358 | -0.625 | -0.141 | 0 | 0 | 0 | 0 | 0 | 1 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.619 | 1.198 | -0.148 | 2.090 | -0.306 | 0.863 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 8 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 5 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 15 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 1 | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
for feature in result.columns:
histogram_boxplot(
result, feature, figsize=(12, 7), kde=False, bins=None
) ## Please change the dataframe name as you define while reading the data
result1 = data.select_dtypes(include="object")
result1.head()
| UrineEsterase | UrineAspect | UrinepH | UrineHemoglobin | UrineBilepigments | UrineKetoneBodies | UrineNitrite | UrineUrobilinogen | UrineProtein | UrineLeukocytes | UrineCrystals | UrineHyalinecylinders | UrineGranularcylinders | UrineYeasts | UrineColor | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
for feature in result1.columns:
labeled_barplot(
result1, feature, perc=True
) ## Please change the dataframe name as you define while reading the data
The bar plots for categorical varible are shown in the above plots with percentage counts.
# correlation check
plt.figure(figsize=(41, 37))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".1f", cmap="Spectral")
plt.show()
The red boxes indicate the highly negative correlation between two variables. If the of one variable increases the correspoding other variable decreases. e.g., Ionised calium and PH are negatively correlated.
The blue boxes indicate the highly positive variables between pair of variables. If the of one variable increases the correspoding other variable also increases. e.g.,pO2 and Arteiral Fio2.
The boxes with the values greater than zero indicating positive correlation between pair of variables.
The boxes with the values smaller than zero indicating negative correlation between pair of variables.
plt.figure(figsize=(7, 6))
sns.barplot(
data=data, x="SARSCov2examresult", y="VitaminB12", ci=False
) ## Complete the code to choose the right variables
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(7, 6))
sns.barplot(
data=data, x="SARSCov2examresult", y="Hematocrit", ci=False
) ## Complete the code to choose the right variables
plt.xticks(rotation=90)
plt.show()
Hematocrit level is high for covid patients
plt.figure(figsize=(7, 6))
sns.barplot(data=data, x="SARSCov2examresult", y="Hemoglobin", ci=False)
plt.xticks(rotation=90)
plt.show()
Hemoglobin level is high for covid patients
plt.figure(figsize=(7, 6))
sns.barplot(data=data, x="SARSCov2examresult", y="Platelets", ci=False)
plt.xticks(rotation=90)
plt.show()
Platelets level is negative for covid patients
plt.figure(figsize=(6, 6))
sns.barplot(
data=data, x="SARSCov2examresult", y="TotalCO2arterialbloodgasanalysis", ci=False
)
plt.xticks(rotation=90)
plt.show()
Total CO2 (arterial blood gas analysis level) is negative for covid patients
plt.figure(figsize=(6, 6))
sns.barplot(data=data, x="SARSCov2examresult", y="RedbloodCells", ci=False)
plt.xticks(rotation=90)
plt.show()
Red blood Cells count is high for covid patients
plt.figure(figsize=(6, 6))
sns.barplot(data=data, x="SARSCov2examresult", y="Lymphocytes", ci=False)
plt.xticks(rotation=90)
plt.show()
Lymphocytes count is positive for non-covid patients
Similar plots can be shown for other variable
plt.figure(figsize=(6, 6))
sns.barplot(
data=data, x="SARSCov2examresult", y="Patientaddmitedtoregularward1yes0no", ci=False
)
plt.xticks(rotation=90)
plt.show()
Pateint addmitted to regular ward is high for covid patients
plt.figure(figsize=(6, 6))
sns.barplot(data=data, x="SARSCov2examresult", y="Leukocytes", ci=False)
plt.xticks(rotation=90)
plt.show()
Leukocytes is negative for covid patients.
plt.figure(figsize=(7, 5))
sns.countplot(x="RespiratorySyncytialVirus", hue="SARSCov2examresult", data=data)
plt.xticks(rotation=60)
plt.legend(loc=1)
plt.show()
The count for not-ditected Respiratory Syncytial Virus is high for non-covid patients.
plt.figure(figsize=(7, 5))
sns.countplot(x="RhinovirusEnterovirus", hue="SARSCov2examresult", data=data)
plt.xticks(rotation=60)
plt.legend(loc=1)
plt.show()
The non-covid patients are detected with Rhinovirus/Enterovirus in high number compared to covid patients.
plt.figure(figsize=(7, 5))
sns.countplot(x="InfluenzaB", hue="SARSCov2examresult", data=data)
plt.xticks(rotation=60)
plt.legend(loc=1)
plt.show()
Influenza B is not detected in high number for non-covid patients.
plt.figure(figsize=(7, 5))
sns.countplot(x="UrinepH", hue="SARSCov2examresult", data=data)
plt.xticks(rotation=60)
plt.legend(loc=1)
plt.show()
Urine - pH level 7 is same for both negative and positive covid patients.
plt.figure(figsize=(7, 5))
sns.countplot(x="Patientagequantile", hue="SARSCov2examresult", data=data)
plt.xticks(rotation=60)
plt.legend(loc=1)
plt.show()
Patient age until 19 is less contracted with covid.
plt.figure(figsize=(6, 6))
sns.barplot(
data=data, y="Leukocytes", x="Patientaddmitedtoregularward1yes0no", ci=False
)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(6, 6))
sns.barplot(
data=data, y="Leukocytes", x="Patientaddmitedtosemiintensiveunit1yes0no", ci=False
)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(6, 6))
sns.barplot(
data=data, y="Leukocytes", x="Patientaddmitedtointensivecareunit1yes0no", ci=False
)
plt.xticks(rotation=90)
plt.show()
The plot shows Leukocytes count is negative for regular ward patients. For serous patients Leukocytes count is positive.
plt.figure(figsize=(6, 6))
sns.barplot(data=data, y="Platelets", x="Patientaddmitedtoregularward1yes0no", ci=False)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(6, 6))
sns.barplot(
data=data, y="Platelets", x="Patientaddmitedtosemiintensiveunit1yes0no", ci=False
)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(6, 6))
sns.barplot(
data=data, y="Platelets", x="Patientaddmitedtointensivecareunit1yes0no", ci=False
)
plt.xticks(rotation=90)
plt.show()
Regular ward patient's platelets count is negative.
plt.figure(figsize=(6, 6))
sns.barplot(
data=data, y="Hematocrit", x="Patientaddmitedtoregularward1yes0no", ci=False
)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(6, 6))
sns.barplot(
data=data, y="Hematocrit", x="Patientaddmitedtosemiintensiveunit1yes0no", ci=False
)
plt.xticks(rotation=90)
plt.show()
plt.figure(figsize=(6, 6))
sns.barplot(
data=data, y="Hematocrit", x="Patientaddmitedtointensivecareunit1yes0no", ci=False
)
plt.xticks(rotation=90)
plt.show()
Hematocrit level is high in negative value for patients in intensive care unit
Similar plots can be done for other variables
EDA Conclusion:
1. EDA analysis shows how different factors varies for covid and non-covid patients.
2. We have seen how the different factor levels vary for patients in different wards.
3. Depending on these plots we can indentify the criticality of the patients and we can save many lives and also save the beds for critical patients.
# Separating target variable and other variables
X = data.drop(columns="SARSCov2examresult")
X = pd.get_dummies(X)
Y = data["SARSCov2examresult"]
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=Y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(3386, 166) (1129, 166) (1129, 166)
imputer = SimpleImputer(strategy="median")
impute = imputer.fit(X_train)
X_train = impute.transform(X_train)
X_val = imputer.transform(X_val)
X_test = imputer.transform(X_test)
X_train.shape
(3386, 160)
Predicting the chances of being negative for covid19 but in reality, the chances of being positive for covid19.
Predicting the chances of being positive for covid19 but in reality, the chances of being negative for covid19.
Both the cases are important as:
If we predict that test result is negative but the patient gets covid then the patient health can get worse and increase the spreading overwhelmingly.
If we predict that test result is positive but the patient does not get covid then the real covid positive patient will be not getting treatment. The hospitals bed will not be available for real patients hence the number of death with rapidly increase.
Let's define a function to output different metrics (including f1) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Cost:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation Cost: Logistic regression: 0.1010219915987151 Bagging: 0.09477211013830732 Random forest: 0.04588701861638716 GBM: 0.11650522465230469 Adaboost: 0.10211208611208611 Xgboost: 0.11697721685891513 dtree: 0.11507436639858164 Validation Performance: Logistic regression: 0.13392857142857142 Bagging: 0.07142857142857142 Random forest: 0.03571428571428571 GBM: 0.07142857142857142 Adaboost: 0.07142857142857142 Xgboost: 0.11607142857142858 dtree: 0.11607142857142858
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
We can see the performance of models with original data, let's see if oversampling the data can help us improve the performance.
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label '1': 334 Before OverSampling, counts of label '0': 3052 After OverSampling, counts of label '1': 3052 After OverSampling, counts of label '0': 3052 After OverSampling, the shape of train_X: (6104, 160) After OverSampling, the shape of train_y: (6104,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results2 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
)
results2.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = f1_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation Performance: Logistic regression: 0.6368448283318868 Bagging: 0.729359612386283 Random forest: 0.7329955714207665 GBM: 0.7369305467656047 Adaboost: 0.7323342047576953 Xgboost: 0.7312771598565031 dtree: 0.7261444145026816 Validation Performance: Logistic regression: 0.2345679012345679 Bagging: 0.24275862068965517 Random forest: 0.24225352112676055 GBM: 0.23907455012853468 Adaboost: 0.22766217870257038 Xgboost: 0.24209078404401652 dtree: 0.23561643835616441
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results2)
ax.set_xticklabels(names)
plt.show()
We can see that oversampling of data helped improve the performance a lot, now let's see how models perform with undersampled data.
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))
print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before UnderSampling, counts of label '1': 334 Before UnderSampling, counts of label '0': 3052 After UnderSampling, counts of label '1': 334 After UnderSampling, counts of label '0': 334 After UnderSampling, the shape of train_X: (668, 160) After UnderSampling, the shape of train_y: (668,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(
("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss", n_jobs=-1))
)
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold, n_jobs=-1
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = f1_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation Performance: Logistic regression: 0.5926654419511507 Bagging: 0.6136065812326101 Random forest: 0.6432519546058566 GBM: 0.6825714531626661 Adaboost: 0.6700922079881821 Xgboost: 0.6377721250662427 dtree: 0.6281974830976645 Validation Performance: Logistic regression: 0.2091503267973856 Bagging: 0.23463687150837986 Random forest: 0.23746701846965698 GBM: 0.24029126213592233 Adaboost: 0.23826714801444043 Xgboost: 0.237597911227154 dtree: 0.22908622908622908
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names)
plt.show()
After looking at performance of all the models, let's decide which models can further improve with hyperparameter tuning
%%time
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [100, 150, 200],
"learning_rate": [0.2, 0.05],
"base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1),
]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over) ## Complete the code to fit the model on over sampled data
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 200, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.7399222256071394:
CPU times: user 4.49 s, sys: 76.2 ms, total: 4.56 s
Wall time: 1min 4s
# Creating new pipeline with best parameters
tuned_ada = AdaBoostClassifier(
n_estimators=200,
learning_rate=0.05,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
tuned_ada.fit(X_train_over, y_train_over)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.05, n_estimators=200)
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
TP = confusion_matrix(target, model.predict(predictors))[1, 1]
FP = confusion_matrix(target, model.predict(predictors))[0, 1]
FN = confusion_matrix(target, model.predict(predictors))[1, 0]
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
ada_train_perf = model_performance_classification_sklearn(
tuned_ada, X_train_over, y_train_over
)
ada_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.686 | 0.925 | 0.626 | 0.746 |
ada_val_perf = model_performance_classification_sklearn(tuned_ada, X_val, y_val)
ada_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.487 | 0.777 | 0.136 | 0.231 |
%%time
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [100, 150, 200],
"learning_rate": [0.2, 0.05],
"base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1),
]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un) ## Complete the code to fit the model on over sampled data
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=1, random_state=1)} with CV score=0.7138602837468871:
CPU times: user 310 ms, sys: 31.5 ms, total: 342 ms
Wall time: 6.46 s
# Creating new pipeline with best parameters
tuned_ada = AdaBoostClassifier(
n_estimators=100,
learning_rate=0.05,
base_estimator=DecisionTreeClassifier(max_depth=1, random_state=1),
)
tuned_ada.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1,
random_state=1),
learning_rate=0.05, n_estimators=100)
ada_train_perf_1 = model_performance_classification_sklearn(
tuned_ada, X_train_un, y_train_un
)
ada_train_perf_1
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.645 | 0.955 | 0.590 | 0.729 |
ada_val_perf_1 = model_performance_classification_sklearn(tuned_ada, X_val, y_val)
ada_val_perf_1
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.358 | 0.938 | 0.128 | 0.225 |
overfitting is reduced.
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"max_depth": np.arange(2, 6),
"min_samples_leaf": [1, 4, 7],
"max_leaf_nodes": [10, 15],
"min_impurity_decrease": [0.0001, 0.001],
}
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=10,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 5} with CV score=0.7234591364185811:
# Creating new pipeline with best parameters
Decision_tree = DecisionTreeClassifier(
min_samples_leaf=7,
random_state=1,
min_impurity_decrease=0.0001,
max_leaf_nodes=15,
max_depth=5,
)
Decision_tree.fit(
X_train_over, y_train_over
) ## Complete the code to fit the model on over sampled data
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=7,
random_state=1)
Decision_tree_train_perf = model_performance_classification_sklearn(
Decision_tree, X_train_over, y_train_over
)
Decision_tree_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.625 | 0.992 | 0.572 | 0.726 |
Decision_tree_val_perf = model_performance_classification_sklearn(
Decision_tree, X_val, y_val
) ## Complete the code to check the performance on validation set
Decision_tree_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.307 | 0.946 | 0.120 | 0.213 |
The model is still overfitting. The validation recall and f1 is still less than the traing recall and f1 in the oversampled data.
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"max_depth": np.arange(2, 20),
"min_samples_leaf": [1, 2, 5, 7],
"max_leaf_nodes": [5, 10, 15],
"min_impurity_decrease": [0.0001, 0.001],
}
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=10,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 5, 'max_depth': 2} with CV score=0.7073961958268696:
# Creating new pipeline with best parameters
Decision_tree_1 = DecisionTreeClassifier(
min_samples_leaf=1,
random_state=1,
min_impurity_decrease=0.001,
max_leaf_nodes=5,
max_depth=2,
)
Decision_tree_1.fit(
X_train_un, y_train_un
) ## Complete the code to fit the model on over sampled data
DecisionTreeClassifier(max_depth=2, max_leaf_nodes=5,
min_impurity_decrease=0.001, random_state=1)
Decision_tree_train_perf_1 = model_performance_classification_sklearn(
Decision_tree_1, X_train_un, y_train_un
)
Decision_tree_train_perf_1
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.609 | 0.985 | 0.562 | 0.716 |
Decision_tree_val_perf_1 = model_performance_classification_sklearn(
Decision_tree_1, X_val, y_val
) ## Complete the code to check the performance on validation set
Decision_tree_val_perf_1
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.279 | 0.955 | 0.117 | 0.208 |
The recall values in validation data has been improved to oversampled data.
# defining model
Model = BaggingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"max_samples": [0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9],
"n_estimators": [30, 50, 70],
}
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=50,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(
X_train_over, y_train_over
) ## Complete the code to fit the model on over sampled data
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'n_estimators': 50, 'max_samples': 0.9, 'max_features': 0.7} with CV score=0.7331281517827823:
# Creating new pipeline with best parameters
tuned_bag2 = BaggingClassifier(max_features=0.7, max_samples=0.8, n_estimators=50)
tuned_bag2.fit(
X_train_over, y_train_over
) ## Complete the code to fit the model on over sampled data
BaggingClassifier(max_features=0.7, max_samples=0.8, n_estimators=50)
bag2_train_perf = model_performance_classification_sklearn(
tuned_bag2, X_train_over, y_train_over
)
bag2_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.689 | 0.879 | 0.637 | 0.739 |
bag2_val_perf = model_performance_classification_sklearn(
tuned_bag2, X_val, y_val
) ## Complete the code to check the performance on validation set
bag2_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.523 | 0.795 | 0.147 | 0.249 |
The recall score in validation set decreases compared to decision tree model but still overfitting. f1 is slightly increasing in validation set.
# defining model
Model = BaggingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"max_samples": [0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9],
"n_estimators": [30, 50, 70],
}
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=50,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(
X_train_un, y_train_un
) ## Complete the code to fit the model on over sampled data
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'n_estimators': 50, 'max_samples': 0.8, 'max_features': 0.9} with CV score=0.645098535794715:
# Creating new pipeline with best parameters
tuned_bag3 = BaggingClassifier(
max_features=0.7, random_state=1, max_samples=0.9, n_estimators=50
)
tuned_bag3.fit(
X_train_un, y_train_un
) ## Complete the code to fit the model on over sampled data
BaggingClassifier(max_features=0.7, max_samples=0.9, n_estimators=50,
random_state=1)
bag3_train_perf = model_performance_classification_sklearn(
tuned_bag3, X_train_un, y_train_un
)
bag3_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.690 | 0.769 | 0.664 | 0.713 |
bag3_val_perf = model_performance_classification_sklearn(
tuned_bag3, X_val, y_val
) ## Complete the code to check the performance on validation set
bag3_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.533 | 0.741 | 0.143 | 0.240 |
The overfitting is reduced in the undersampled data.
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [200, 250, 300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1), "sqrt"],
"max_samples": np.arange(0.4, 0.7, 0.1),
}
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=50,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(
X_train_over, y_train_over
) ## Complete the code to fit the model on over sampled data
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 2, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.7402405743208096:
# Creating new pipeline with best parameters
tuned_rf2 = RandomForestClassifier(
max_features="sqrt",
random_state=1,
max_samples=0.6,
n_estimators=200,
min_samples_leaf=2,
)
tuned_rf2.fit(
X_train_over, y_train_over
) ## Complete the code to fit the model on over sampled data
RandomForestClassifier(max_features='sqrt', max_samples=0.6, min_samples_leaf=2,
n_estimators=200, random_state=1)
rf2_train_perf = model_performance_classification_sklearn(
tuned_rf2, X_train_over, y_train_over
) ## Complete the code to check the performance on oversampled train set
rf2_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.660 | 0.990 | 0.596 | 0.744 |
rf2_val_perf = model_performance_classification_sklearn(
tuned_rf2, X_val, y_val
) ## Complete the code to check the performance on validation set
rf2_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.371 | 0.848 | 0.121 | 0.211 |
The model is overfitting in trainning data.
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [200, 250, 300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1), "sqrt"],
"max_samples": np.arange(0.4, 0.7, 0.1),
}
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=50,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(
X_train_un, y_train_un
) ## Complete the code to fit the model on over sampled data
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'n_estimators': 200, 'min_samples_leaf': 2, 'max_samples': 0.5, 'max_features': 'sqrt'} with CV score=0.6964227769936958:
# Creating new pipeline with best parameters
tuned_rf3 = RandomForestClassifier(
max_features="sqrt",
random_state=1,
max_samples=0.5,
n_estimators=300,
min_samples_leaf=2,
)
tuned_rf3.fit(
X_train_un, y_train_un
) ## Complete the code to fit the model on over sampled data
RandomForestClassifier(max_features='sqrt', max_samples=0.5, min_samples_leaf=2,
n_estimators=300, random_state=1)
rf3_train_perf = model_performance_classification_sklearn(
tuned_rf3, X_train_un, y_train_un
) ## Complete the code to check the performance on oversampled train set
rf3_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.653 | 0.955 | 0.595 | 0.733 |
rf3_val_perf = model_performance_classification_sklearn(
tuned_rf3, X_val, y_val
) ## Complete the code to check the performance on validation set
rf3_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.358 | 0.938 | 0.128 | 0.225 |
The overfitting is reduced.
%%time
# defining model
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid={"n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7]}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, scoring=scorer, n_iter=50, n_jobs = -1, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 0.05} with CV score=0.7357013576995698:
CPU times: user 1.16 s, sys: 61.1 ms, total: 1.22 s
Wall time: 25.5 s
# Creating new pipeline with best parameters
tuned_gbm = GradientBoostingClassifier(
max_features=0.7,
# random_state=1,
subsample=0.7,
n_estimators=125,
learning_rate=0.05,
)
tuned_gbm.fit(X_train_over, y_train_over)
GradientBoostingClassifier(learning_rate=0.05, max_features=0.7,
n_estimators=125, subsample=0.7)
gbm_train_perf = model_performance_classification_sklearn(
tuned_gbm, X_train_over, y_train_over
)
gbm_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.676 | 0.915 | 0.619 | 0.739 |
gbm_val_perf = model_performance_classification_sklearn(tuned_gbm, X_val, y_val)
gbm_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.476 | 0.848 | 0.142 | 0.243 |
The overfitting is reduced compared to previous models.
%%time
# defining model
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid={"n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7]}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, scoring=scorer, n_iter=50, n_jobs = -1, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 0.05} with CV score=0.6996007233854158:
CPU times: user 255 ms, sys: 25.4 ms, total: 280 ms
Wall time: 2.7 s
# Creating new pipeline with best parameters
tuned_gbm_1 = GradientBoostingClassifier(
max_features=0.7,
random_state=1,
subsample=0.500000000000000,
n_estimators=100,
learning_rate=0.05,
)
tuned_gbm_1.fit(X_train_un, y_train_un)
GradientBoostingClassifier(learning_rate=0.05, max_features=0.7, random_state=1,
subsample=0.5)
gbm_train_perf_1 = model_performance_classification_sklearn(
tuned_gbm_1, X_train_un, y_train_un
)
gbm_train_perf_1
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.680 | 0.904 | 0.624 | 0.738 |
gbm_val_perf_1 = model_performance_classification_sklearn(tuned_gbm_1, X_val, y_val)
gbm_val_perf_1
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.445 | 0.884 | 0.139 | 0.240 |
The model is performing better in the undersampled data.
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':[150,200,250],'scale_pos_weight':[5,10], 'learning_rate':[0.1,0.2], 'gamma':[0,3,5], 'subsample':[0.8,0.9]}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over) ## Complete the code to fit the model on over sampled data
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'scale_pos_weight': 5, 'n_estimators': 250, 'learning_rate': 0.1, 'gamma': 0} with CV score=0.7423770447207307:
CPU times: user 22.3 s, sys: 140 ms, total: 22.4 s
Wall time: 8min 59s
xgb2 = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.9,
scale_pos_weight=5,
n_estimators=200,
learning_rate=0.2,
gamma=0,
)
xgb2.fit(
X_train_over, y_train_over
) ## Complete the code to fit the model on over sampled data
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.2, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=200, n_jobs=8,
num_parallel_tree=1, predictor='auto', random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=5, subsample=0.9,
tree_method='exact', validate_parameters=1, verbosity=None)
xgb2_train_perf = model_performance_classification_sklearn(
xgb2, X_train_over, y_train_over
) ## Complete the code to check the performance on oversampled train set
xgb2_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.663 | 0.998 | 0.598 | 0.748 |
xgb2_val_perf = model_performance_classification_sklearn(
xgb2, X_val, y_val
) ## Complete the code to check the performance on validation set
xgb2_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.368 | 0.893 | 0.125 | 0.219 |
The XGboost performance is better than other previous models in the traing data.
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':[150,200,250],'scale_pos_weight':[5,10], 'learning_rate':[0.1,0.2], 'gamma':[0,3,5], 'subsample':[0.8,0.9]}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un) ## Complete the code to fit the model on over sampled data
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.2, 'gamma': 5} with CV score=0.7174464115340762:
CPU times: user 2.93 s, sys: 111 ms, total: 3.04 s
Wall time: 1min 55s
xgb2_1 = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.9,
scale_pos_weight=5,
n_estimators=250,
learning_rate=0.2,
gamma=3,
)
xgb2_1.fit(
X_train_un, y_train_un
) ## Complete the code to fit the model on over sampled data
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
eval_metric='logloss', gamma=3, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.2, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=250, n_jobs=8,
num_parallel_tree=1, predictor='auto', random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=5, subsample=0.9,
tree_method='exact', validate_parameters=1, verbosity=None)
xgb2_train_perf_1 = model_performance_classification_sklearn(
xgb2_1, X_train_un, y_train_un
) ## Complete the code to check the performance on oversampled train set
xgb2_train_perf_1
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.638 | 0.991 | 0.581 | 0.732 |
xgb2_val_perf_1 = model_performance_classification_sklearn(
xgb2_1, X_val, y_val
) ## Complete the code to check the performance on validation set
xgb2_val_perf_1
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.303 | 0.964 | 0.121 | 0.215 |
The overfitting is reduced.
We have now tuned all the models, let's compare the performance of all tuned models and see which one is the best.
# training performance comparison
models_train_comp_df = pd.concat(
[
ada_train_perf.T,
Decision_tree_train_perf.T,
gbm_train_perf.T,
xgb2_train_perf.T,
bag2_train_perf.T,
rf2_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Adaboost tuned with oversampled daata",
"Decision Tree tuned with oversampled data",
"Gradient Boosting tuned with oversampled data",
"XGBoost tuned with oversampled data",
"Bagging classifier tuned with oversampled data",
"Random forest tuned with oversampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Adaboost tuned with oversampled daata | Decision Tree tuned with oversampled data | Gradient Boosting tuned with oversampled data | XGBoost tuned with oversampled data | Bagging classifier tuned with oversampled data | Random forest tuned with oversampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.686 | 0.625 | 0.676 | 0.663 | 0.689 | 0.660 |
| Recall | 0.925 | 0.992 | 0.915 | 0.998 | 0.879 | 0.990 |
| Precision | 0.626 | 0.572 | 0.619 | 0.598 | 0.637 | 0.596 |
| F1 | 0.746 | 0.726 | 0.739 | 0.748 | 0.739 | 0.744 |
# training performance comparison
models_train_comp_df = pd.concat(
[
ada_train_perf_1.T,
Decision_tree_train_perf_1.T,
gbm_train_perf_1.T,
xgb2_train_perf_1.T,
bag3_train_perf.T,
rf3_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Adaboost tuned with undersampled daata",
"Decision Tree tuned with undersampled data",
"Gradient Boosting tuned with undersampled data",
"XGBoost tuned with undersampled data",
"Bagging classifier tuned with undersampled data",
"Random forest tuned with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Adaboost tuned with undersampled daata | Decision Tree tuned with undersampled data | Gradient Boosting tuned with undersampled data | XGBoost tuned with undersampled data | Bagging classifier tuned with undersampled data | Random forest tuned with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.645 | 0.609 | 0.680 | 0.638 | 0.690 | 0.653 |
| Recall | 0.955 | 0.985 | 0.904 | 0.991 | 0.769 | 0.955 |
| Precision | 0.590 | 0.562 | 0.624 | 0.581 | 0.664 | 0.595 |
| F1 | 0.729 | 0.716 | 0.738 | 0.732 | 0.713 | 0.733 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
ada_val_perf.T,
Decision_tree_val_perf.T,
gbm_val_perf.T,
xgb2_val_perf.T,
bag2_val_perf.T,
rf2_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Adaboost tuned with validation daata",
"Decision Tree tuned with validation data",
"Gradient Boosting tuned with validation data",
"XGBoost tuned with validation data",
"Bagging classifier tuned with validation data",
"Random forest tuned with validation data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Adaboost tuned with validation daata | Decision Tree tuned with validation data | Gradient Boosting tuned with validation data | XGBoost tuned with validation data | Bagging classifier tuned with validation data | Random forest tuned with validation data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.487 | 0.307 | 0.476 | 0.368 | 0.523 | 0.371 |
| Recall | 0.777 | 0.946 | 0.848 | 0.893 | 0.795 | 0.848 |
| Precision | 0.136 | 0.120 | 0.142 | 0.125 | 0.147 | 0.121 |
| F1 | 0.231 | 0.213 | 0.243 | 0.219 | 0.249 | 0.211 |
We want recall value maximized, i.e., minimising FN indicating less spreading the desease. We want also the precision maximized, i.e., minimising FP, indicating reducing death and help to increase the hospital accomodation. Dtree model gives highest recall score but precision score is less compared to gradient boosting model. Gradient boosting performing better in both recall and precision value. So, we will choose gradient boosting tuned with oversampled data as the final model.
Now we have our final model, let's find out how our model is performing on unseen test data
gbm_test = model_performance_classification_sklearn(tuned_gbm, X_test, y_test)
gbm_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.438 | 0.839 | 0.132 | 0.229 |
feature_names = X.columns
importances = tuned_gbm.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 60))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Model = Pipeline(
steps=[
# ("imputer", SimpleImputer(strategy="median")),
(
"GradientBoosting Classifier",
GradientBoostingClassifier(
max_features=0.7,
random_state=1,
subsample=0.7,
n_estimators=125,
learning_rate=0.05,
),
),
]
)
Model.fit(X_train, y_train)
Pipeline(steps=[('GradientBoosting Classifier',
GradientBoostingClassifier(learning_rate=0.05,
max_features=0.7, n_estimators=125,
random_state=1, subsample=0.7))])
Model_test = model_performance_classification_sklearn(Model, X_test, y_test)
Model_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.903 | 0.062 | 0.636 | 0.114 |
Our Analysis shows that gradient boosting model gives generalized performance with high recall score, that needed for minimizing the false negative. The disease spreading will be less. We want to minimize false negatives because if we predict that test result is negative, and the patient gets covid then the patient health will get worse and increase the spreading overwhelmingly.
The model also predict higher precision, i.e., lower false positive. It help to reduce death and availability of hospital beds. We want to minimize FP because if we predict that test result is positive, and the patient does not get covid then the real covid positive patient will not get treatment. The hospitals bed will not be available for real patients hence the number of death with rapidly increase.
Patient_age_quantile is most important feature followed by Rhinovirus/Enterovirus, Patient admitted to regular ward, Platelets and Leukocytes.
This model can be further used to detect the patient will get covid or not. It will help to get covid identification and treatment for real covid patients. Also, it will help to get the information of availability of beds.
The persons with age less than 19 has less chance to get contracted. The Platelets and Leukocytes levels for regular ward patients are negative. When they turn positive indicating patients need intensive care and special treatments.